An Analysis of Salaries and Cost of Living in Different US Cities

By James Brunner and Vyoma Jani

Introduction

As computer science majors, we hear lots of stories about people working their dream jobs in Silicon Valley. Top companies and high pay definitely seems appealing! But we've also heard that the Bay Area is a very expensive place to live, so is it really worth moving there for the high pay?

This tutorial is going to explore the salaries, total years of experience, and other factors for a number of computer scientists across the country. Then, it will look into the cost of living in each of these locations, before determining what cities are best to live in and which ones may be better to avoid. We'll be going through the entire data science pipeline, from gathering the data and organizing it, to performing exploratory data analysis and machine learning on the information collected!

Data Collection

The first step in our process is to scour the web in search of relevant data. First, we will be looking at salary data from Levels.fyi. This dataset contains information about employees from various companies, levels, roles, locations, and paygrades, all collected from users who voluntarily provided their information.

To read in the dataset, we will first generate an HTTP get request using the requests library, and then we will use pandas, a Python library popular for data manipulation and analysis, to convert the json file from the get request into a dataframe.

We will also be looking at the following dataset from Numbeo.com, which provides information regarding the cost of living for various US cities, breaking down the total index into subdivisions like rent index or groceries index. This information is similarly collected voluntarily from users.

Like before, we will be using pandas and the requests library, but we will also be needing the BeautifulSoup library to parse the html data from the request into the dataframe.

Data Processing

Now that we have gathered our two datasets, we will tidy up the data to make it easier to analyze. For the Levels.fyi data, we will again be needing pandas, and we're also importing the datetime module to help us convert the timestamp information into datetime objects, which will standardize the time.

For the Levels.fyi dataset, we observe that many of the numeric columns, like totalyearlycompensation, are being stored as strings in the datasets. Because we will be interpreting the values as numbers instead of strings, we will convert the values in those numeric columns to numbers.

Also, because we are computer science majors, we're only focusing on computer science jobs!

Before moving into exploratory data analysis, it is important to check our data for outliers and for any values that may have errors or may have been computed incorrectly.

From these summary statistics, we can observe there are some serious outliers. The mean base salary is $3,287,000, which indicates that we may have some faulty values causing the data to be left-skewed. The same appears to be true for stock grant value and bonus. We can also see how significant the outliers get by observing how large the max for each column is in comparison to the 75th percentile.

Fix this ^ v

We want to avoid these outliers to prevent our analysis from being skewed, so we will trim the __ highest and lowest salaries from our dataset. Note that there is a risk of introducing bias here, since it is possible that those trimmed salaries are far from data for a reason relevant to our analysis.

Our next step in data preparation involves filtering out all jobs that are not located in the United States. We can observe that in the location column of the salaries dataset, locations in the United States are formatted as "City, State", while other countries (such as Canada) are formatted as "City, State/Province, Country". Using Regular Expressions, we can easily filter out these rows.

Now that both datasets have been properly cleaned, the last step is to merge them together. One noticeable problem is that the Numbeo dataset does not contain all the same cities as the Levels.fyi dataset, and therefore we cannot make a simple join between the two.

Instead, since major cities have an influence on the economics of nearby towns, we will use an algorithm to join each Levels.fyi entry to a closest neighboring city that is in the Numbeo dataset, as long as they are within a reasonable distance from each other.

Step #1:

To Do

Step #2:

To Do

We will also discard any rows that cannot match a city in our algorithm, as we consider those jobs too far from major U.S. cities and we are not interested in those for the purpose of this analysis.

For the algorithm, we will make use of the Geopy library, which offers tools to locate coordinates across the globe.

Data Exploration and Analysis

Now that are datasets are clean and organized together, it is time to analyze them!

We will be using Plotly and Seaborn to construct various plots and figures to examine the dataset. We will also use Folium to render a map visual of the data based on location.

First, we will examine the relationship between years of experience and base salary using a scatterplot.

It looks like there's a positive linear relationship between years of experience and base salary!

However, does this relationship extend for all job positions? Or are there specific Computer Science roles that have a different relationship between base salary and years of experience?

Below, we plot the years of experience against the base salary of the 7 roles we are looking to study. We will create separate scatterplots for the roles, and will plot a line of best fit to see if there is some type of linear relationship between the two for each role.

<<<<<<< HEAD

{% include figure1.html %}

Based on the above plots, we can see a positive correlation between years of experience and base salary for each role. It looks like the base salary increases relatively slowly over years of experience for roles like Software Engineering Manager, compared to roles like Software Engineer.

=======

Based on the above plots, we can see a positive correlation between years of experience and base salary for each role. It looks like the base salary increases relatively slowly over years of experience for roles like Software Engineering Manager, compared to roles like Software Engineer.

Now we want to narrow our focus to specific cities and companies. We'll be looking at the average base salary per US city and per company in our dataset. Below, we calculate the average base salaries.

First, we calculate the average base salary per city and company in the dataframe.